Load data

datos <- read.csv(file = 'data.csv')
datos$goal <- factor(datos$goal)

Descriptive analysis

In order to distinguish any possible mistake in or data, it’s appropiate to do a previous descriptive analysis. Next it will be shown the data structure in a breef summary.

summary(datos)
##  goal            X               Y        
##  0:19844   Min.   : 45.3   Min.   : 0.70  
##  1: 2829   1st Qu.: 97.5   1st Qu.:32.10  
##            Median :105.2   Median :40.00  
##            Mean   :103.7   Mean   :39.76  
##            3rd Qu.:111.0   3rd Qu.:47.00  
##            Max.   :120.0   Max.   :79.50

The goal variable is a dummy variable which shows with a 1 a goal scored and with a 0 a goal missed. Furthermore, it’s known the shot coordinates. A football field is 120 meters long for 80 meters wide according to the database.

Therefore, it’s possible to check in the variable X (long) the position where the further (46 m) and the nearest shot (0 m) were taken. Anyways the average shot is taken from 15 meters away from the goal.

Talking about the wide variable (Y variable), the average shot is taken in the middle of the field. That means that although there are shots all along the field, they tend to concentrate from 30 meters to 60 meters from any touchline.

Also, it is very remarkable that there are not missing values.

Next the variables will be observed in further detail.

boxplot(datos$X, horizontal = TRUE)
legend ("topleft", legend = "Variable X") 

In the variable X box&whiskers, it’s shown that there is a unbalanced distribution, and the shots, as it had been said before, are usually taken from the atacking field. However, there are some cases that we could consider outliers that are shooted from a further distance. Consequently it’s observable a tail in the left of the box.

boxplot(datos$Y, horizontal = TRUE)
legend ("topleft", legend = "Variable Y") 

On the other hand, the variable Y has a much more similar distribution to the Gauss bell, despite that some outliers remain in both sides of the box. Therefore, it can be assumed a platicurtic distribution.

To observe the relation between both variables it will be used the scatter plot with the shots coordinates.

plot(datos$X, datos$Y)

This picture shows in a more representative way, where the shots are taken. The shape of the graphic is similar to a football field, being the centre right of the picture the goal where the players are shooting. Thus, the graphic shows that almost every shot is taken from the center and from less than 20 meters of distance.

Now, the predicted variable will be studied with a pie chart.

pie(table(datos$goal), labels = c("Fail", "Goal"), main="Goals pie chart")

Looking at the graphic, it’s observable the difference between the amount of shots that end scoring a goal and the amount of goals that will end missing. The discriminating analysis, the method that will be used to study the probabilities is acceptable when the categories are balanced so this is a problem that will need to be solved. The best form to do it is with undersampling or oversampling. Due to that there is no access to get more data, it will be used the undersampling.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.0     v purrr   0.3.4
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
datos$goal <- ifelse(datos$goal == 0, "Miss", "Score")
datos2 <- datos
datos2$goal <- ifelse(datos2$goal == "Score", "Miss", "Score")
datos2$Click <- 0
datos$Click <- 1
datos <- rbind(datos, datos2)

library(ROSE)
## Loaded ROSE 0.0-4
balanced_sample = NULL
 
 
for (c in unique(datos$goal)) {
  tmp_df = datos%>%filter(goal=="Score")
  tmp<-ovun.sample(Click ~ ., data = tmp_df, method = "under", p = 0.5, seed = 5)$data
  balanced_sample<-rbind(balanced_sample, tmp)
 }

shots <- balanced_sample[, -4]
shots$goal <- balanced_sample$Click

The next is a graphic that classifies in red the shots that score and in black, the shots that miss.

plot(shots$X, shots$Y, col=factor(shots$goal))
legend("topleft", legend = levels(factor(shots$goal)), pch = 19, col = factor(levels(shots$goal)))

he shots that score, are nearer to the goal line and usually in the center. Besides, it can be observed that the data have a certain curvature in the center (Y variable) so this is an argument in favor of the quadratic discriminant analysis.

To prove that a linear analysis would be useless in this case, the next graphic will describe if the differences between classes are observable or not.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In these histograms, it is shown what already was obvious. The distribution in the X variable in not normal as previously commented.

Besides, the different categories are not clearly distinguished so the linear analysis would be a disaster.

Inference

The final objective of this study is to be able to classify the shots taken in goal or in miss using the shot coordinates. Thus it will be made a discriminant analysis that will classify the shots using the coordinates.

Because of that it will be needed a division in the data. Some cases will be used as trainers of the model while the others will be the test in which the results will be predicted. The data used as training sample will be the 80% and for validation it will be used the k-fold method.

A question that may appear is if it’s needed to standardize the data. Not in this case because the results obtained would be the same. Besides, the quadratic discriminant analysis has the advantage to be quite strong against the outliers. However, it will be needed to know that the data doesn’t follow a normal distribution when the results will be interpreted.

So the model is implemented.

Validation

Next the mistakes that may have appeared in the data will be studied.

predicted <- predict(object = qda_model2, newdata = testData)
table(testData$goal, predicted$class, 
      dnn = c("Real Outcome", "Predict Outcome"))
##             Predict Outcome
## Real Outcome   0   1
##            0 642 522
##            1 179 925

This model has an accuracy of 0.6909171 and a rate of false positives of0.2788162. Both are acceptable.

Model interpretation

In order to visualize better this, the next graphic will show the probability and the coordinates.

library(plotly)
library(gapminder)

predictions <- trainData
predictions$probFail = qda_model$posterior[, 1]
predictions$probGoal = qda_model$posterior[, 2]
library(ggplot2)

graph = ggplot(predictions, aes(X, Y, size=probGoal, color =probFail)) + geom_point()
ggplotly(p = graph)

In the model, it’s shown that when the shot is taken from near the 120 metres (near to the goal) it will be certainly easy to score a goal. It will also be easy if the shot is taken from the centre and not from near the touchline.

But this model would be useless if it couldn’t predict new variables, so the next chunk simply predicts the result based in the data introduced in the program.

new <- data.frame(row.names = c("X", "Y"))
new$X <- readline(prompt = "Enter X: ")
## Enter X:
new$Y <- readline(prompt = "Enter Y: ")
## Enter Y:
new$X <- as.numeric(new$X)
new$Y <- as.numeric(new$Y)


predict(object = qda_model2, newdata = new)
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## $class
## [1] <NA> <NA>
## Levels: 0 1
## 
## $posterior
##    0  1
## X NA NA
## Y NA NA